Data Collection & Warehousing Grade 12
Modern databases are fed by many automated data collection methods. Large organisations store historical data in warehouses for analysis and decision-making.
Methods of Data Collection
Data collection is the process of gathering raw facts and figures from many different sources so that they can be stored, processed and turned into useful information. In the old days a person sat with a clipboard and wrote everything down by hand. Today most data is collected automatically — every time you swipe a card, drive under an e-toll gantry or tap a website, data is being captured about you without anyone writing a single thing.
A database is only as good as the data inside it. The more accurate, complete and up-to-date the collected data is, the better the decisions an organisation can make. Automated collection means more data, collected faster, with fewer human mistakes.
Example: When you scan your Pick n Pay Smart Shopper card at the till, the shop instantly records what you bought, when, where and for how much — that single swipe feeds a database that the shop later uses to send you targeted specials.
| Method | Description | Example |
|---|---|---|
| Forms (manual) | User types data directly into a form | School registration, job application |
| Web forms | Online interactive pages with GUI components (dropdowns, checkboxes) | Online survey, account sign-up |
| RFID tags | Wireless chips store ID data, read without contact | Library books, inventory, e-toll, access cards |
| Digital sensors | Gather environmental data automatically | Temperature sensors, motion detectors |
| Cookies | Small files stored by websites to record browsing behaviour and preferences | Login sessions, shopping carts, ad targeting |
| Transaction tracking | Records each purchase: date, time, location, product, amount | Credit card transactions, loyalty cards |
| Mobile apps | Collect app usage, location, device data | Uber location tracking, fitness app steps |
Static vs Dynamic Location Data
| Type | Description | Example |
|---|---|---|
| Static | Fixed, does not change once recorded | Address of a shop, GPS coordinates of a landmark |
| Dynamic | Changes as the object moves, updated continuously | Uber driver location, delivery truck tracking |
Data Warehousing
A data warehouse is a large central storage system that collects and consolidates historical data from many separate source databases into one single place, so that it can be analysed for trends and decision-making.
Think of it like this: a normal database is like the till at one Pick n Pay branch — it records today's sales as they happen. A data warehouse is like head office in Cape Town, where the sales from every branch in the country, going back many years, are all poured into one giant store. Head office does not use this to ring up a sale; they use it to spot the big picture — which products sell best in winter, which branches are growing, where to open the next store.
Central – one single store that brings many sources together.
Historical – it keeps years of old data, not just today's transactions.
Read-mostly – data is loaded in and then read for reports; it is not constantly changed.
Subject-oriented – organised around topics like "sales" or "customers" for easy analysis.
Note: Because a warehouse holds so much data from so many sources, the data must first be cleaned and put into a consistent format before it is loaded in — otherwise the same customer might appear three different ways and the analysis would be wrong.
| Database | Data Warehouse | |
|---|---|---|
| Purpose | Day-to-day transactions | Historical analysis and reporting |
| Data | Current transactions only | Historical data from many sources |
| Operations | Read + write (INSERT, UPDATE, DELETE) | Mostly read (SELECT, reports) |
| Used by | Operational staff | Analysts, management, BI tools |
Data Mining
Data mining is the process of analysing very large datasets to discover hidden patterns, trends and useful relationships that a human could never spot by eye. The name is a good clue: just as a gold miner digs through tonnes of rock to find a few grams of gold, data mining digs through millions of records to find the valuable nuggets of knowledge buried inside.
Why Do Companies Mine Data?
Companies collect enormous amounts of data, but raw data on its own is useless — it is just numbers and words. Data mining turns that raw data into insight they can act on. For example, Pick n Pay's Smart Shopper data might reveal that customers who buy nappies on a Friday evening also tend to buy snacks — so the shop places those products near each other and increases sales. Nobody told the computer to look for that link; the pattern was mined out of the data.
Process
- Extract relevant data — use SQL to pull only the required subset
- Look for patterns — analyse for trends, correlations, anomalies
- Discover knowledge — turn patterns into actionable insights
Applications
- Marketing: target campaigns based on customer age, interests, behaviour
- Healthcare: predict patient risk factors from historical data
- Government: determine social grant thresholds
- Social media: Facebook mines your likes to suggest friends and content
Caring for Data
Caring for data means protecting the accuracy, security and availability of an organisation's data throughout its life. Data is one of the most valuable assets a company owns — often worth more than its buildings or equipment — because decisions, money and reputations all depend on it. Poor data leads to poor decisions, and lost data can shut a business down completely.
"Garbage in, garbage out." If incorrect data goes into the system, only incorrect information can come out — no matter how clever your analysis is. This is why validation and verification at the point of capture are so important.
Below are the main ways an organisation looks after its data. Notice how each one protects against a different danger — wrong data, lost data, or stolen data.
- Validation — ensure entered data meets rules (format, range, type)
- Verification — ensure data is correct and intentional (double entry, confirmation)
- Logging — audit trail recording who changed what and when
- Access control — passwords, user rights, encryption
- Parallel data sets — duplicate copies in separate locations for disaster recovery