Knowing where your files live, how data is encoded, what different file types mean, and how to prepare clean data for AI tools โ these are the skills that separate productive AI use from frustrating trial and error.
Core Concept
Every piece of data on a computer is stored as a file โ a named container holding data (text, images, code, video). Files are organized into folders (also called directories), which can contain other folders, creating a tree-like hierarchy. This entire structure is managed by the file system (NTFS on Windows, APFS on Mac).
Understanding this hierarchy means you always know where to find your work and can communicate clearly about file locations โ critical when sharing documents with colleagues or troubleshooting with IT.
Think of a filing cabinet. The cabinet itself is your hard drive. Drawers are top-level folders. Within each drawer are hanging folders (sub-folders). Inside those are individual documents (files). A file path is like the address: "Cabinet A โ Drawer: Finance โ Folder: Q3 2024 โ Document: Budget.xlsx"
Almost every AI interaction involves a file. Uploading a PDF for Claude to analyze, pointing a model at a CSV of sales data, saving AI-generated code to the right folder โ all of this requires knowing your file paths, extensions, and data types. But knowing where your files are is only half the story. The other half is making sure the data inside those files is clean, correctly typed, and in a format the AI can actually parse. This module covers both: the file system fundamentals and the data preparation skills that make AI tools work reliably.
Interactive โ Click to Decode
A file path is the complete address of a file on your computer. Each segment has a specific meaning.
WINDOWS PATH:
C:\Users\JohnSmith\Documents\Acme_Corp\Q3_Report.xlsx
C: = the drive letter (your main hard drive) ยท Users\JohnSmith = your user profile folder ยท Documents\Acme_Corp = subfolder path ยท Q3_Report.xlsx = the file (an Excel spreadsheet)
MAC / LINUX PATH:
/Users/jsmith/Desktop/Presentation.pptx
/ = the root (top of the file system) ยท Users/jsmith = user profile ยท Desktop = folder on the desktop ยท Presentation.pptx = PowerPoint file
โ Did you catch the difference?
The slashes go in opposite directions depending on the operating system. This is one of the smallest details in computing and one of the most consequential. Mix them up and a file path silently breaks.
Backslash
Windows
vs.
Forward slash
Mac / Linux / URLs
Think of the slash as part of the address system โ just like different countries
write addresses in different orders (street first in the US, country first in Japan), different
operating systems use different separators. The "map" (the OS) determines which slash
means "go into this folder." Use the wrong one and the OS doesn't recognize the address.
Web URLs always use forward slashes, regardless of your OS โ which is why
https://example.com/page
works everywhere.
Quick check โ test yourself:
A colleague sends you this file path and says they're on a Mac:
C:\Users\Ana\Documents\report.pdf
What's wrong with this path?
File Extensions
The letters after the final dot in a filename tell the OS what type of data the file contains and which app should open it.
Important: Windows hides file extensions by default. To show them: open File Explorer โ View โ Show โ File name extensions. This helps verify file types โ crucial for security (a virus might be named invoice.pdf.exe).
Under the Hood
Every file โ whether it's a spreadsheet, a photo, or an email โ is stored as a sequence of binary digits (bits): ones and zeros. A single bit is a 1 or 0. Eight bits make a byte. Your 256 GB hard drive holds roughly 256 billion bytes. But raw bits are meaningless without knowing how to interpret them โ and that's where data types and encoding come in.
Think of a sequence of dots and dashes. Without knowing it's Morse code, it's just a pattern. The dots and dashes are the bits; Morse code is the encoding. The word you decode is the data type. Computers work the same way โ the same bits can represent a number, a letter, or a pixel color depending on how the software is told to read them.
When you look at a spreadsheet, you see names, dates, prices, and yes/no fields. A computer sees each of these as a specific data type โ and the type determines what operations are possible. Getting the type wrong is one of the most common causes of errors when working with data tools and AI.
| Data Type | What It Stores | Examples | Watch Out For |
|---|---|---|---|
| Integer | Whole numbers (no decimals) | 42, -7, 0, 1000000 | A zip code like 02134 may lose its leading zero if stored as an integer |
| Float / Decimal | Numbers with decimal points | 3.14, -0.5, 99.99 | Floating-point math can produce tiny rounding errors (0.1 + 0.2 = 0.30000000000000004) |
| String (Text) | Any sequence of characters | "Hello", "97301", "N/A" | The number "42" stored as text can't be added to another number without conversion |
| Boolean | True or False (yes/no, on/off) | TRUE, FALSE | Some systems encode True as 1 and False as 0; others use "Yes"/"No" โ mixing causes errors |
| Date / DateTime | Calendar dates and times | 2024-03-15, 2024-03-15T14:30:00 | Dates are notoriously messy โ "03/04/2024" means March 4 in the US but April 3 in Europe |
| Categorical | One value from a fixed set | "Red", "Blue", "Green" or "Small", "Medium", "Large" | Typos create phantom categories โ "Smal" and "Small" look like two different values |
When you upload a CSV to an AI tool or a data pipeline, the system has to infer what type each column is. If your "Revenue" column contains the text entry "$1,200" instead of the number 1200, the system may treat the entire column as text โ and every calculation breaks silently. AI tools are powerful but not magic: they work with what you give them. Understanding data types helps you catch these problems before the AI does something confidently wrong with bad input.
Text is stored using an encoding system that maps each character to a
number. The modern standard is UTF-8, which handles virtually every character
in every language (plus emoji). Older systems used ASCII (English only, 128 characters) or
regional encodings like Latin-1 (Western European). When you open a file and see garbled
characters like รยฉ
instead of รฉ,
it means the file was saved in one encoding but opened in another. When working with AI tools,
stick to UTF-8 โ it's what almost everything expects.
Where Files Live
Files saved directly to your hard drive or SSD. Fast access, works without internet. Located in C:\Users\YourName\ (Windows) or /Users/yourname/ (Mac). Risk: if your computer fails without backup, files can be lost.
Files on a central company server. Appears as a drive letter like Z: or \\server\share. Multiple people can access and edit. Requires company network or VPN. IT manages backups.
Files on remote servers accessed via internet: OneDrive, Google Drive, Dropbox, SharePoint. Files sync across all devices. Accessible anywhere. A folder on your computer syncs automatically to the cloud.
Data Formats for AI
Not all data is the same shape. Understanding the difference between structured and unstructured data is essential for knowing what AI tools can do with your files โ and which file format to use.
Data organized into rows and columns with consistent types โ like a spreadsheet or database table. Each column has a name and a data type; each row is a record.
Formats: .csv, .xlsx, .tsv, .json (tabular), database tables
Examples: sales records, employee rosters, survey responses, financial reports
AI use: data analysis, trend detection, forecasting, dashboards
Data without a predefined row/column format โ text documents, emails, images, audio, video. Most of the world's data is unstructured.
Formats: .pdf, .docx, .txt, .jpg, .mp4, .html, email threads
Examples: customer emails, meeting transcripts, contracts, photos
AI use: summarization, extraction, classification, image analysis
Different AI tools expect different inputs. When you paste a table of data into Claude as plain text, it works โ but when you upload the same data as a well-formatted .csv file, the AI can parse the columns, understand the types, and give you better analysis. Knowing which format to use is the difference between a vague answer and a precise one.
.csv (Comma-Separated Values) โ The universal format for tabular data. Each line is a row; values are separated by commas. Nearly every tool can read it: Excel, Google Sheets, Python, R, AI platforms. When in doubt about how to share data with an AI tool, CSV is almost always the right answer.
.json (JavaScript Object Notation) โ Structured data in key-value pairs. Used heavily by APIs and web services. More flexible than CSV โ can represent nested data (a customer who has multiple orders, each with multiple items). AI APIs (like OpenAI's) send and receive JSON.
.txt (Plain Text) โ No formatting, no structure โ just characters. Useful for unstructured data like meeting notes, email bodies, or raw text you want an AI to summarize or classify. The simplest format, readable by everything.
.pdf (Portable Document Format) โ Preserves visual layout, but data inside is often hard for AI to parse. A PDF of a table looks clean to humans but may be a jumbled mess to an AI that has to extract the text. When possible, give the AI the underlying data (CSV, XLSX) rather than the PDF report generated from it.
Data Preparation
AI tools are powerful, but they are not tolerant of messy input. If you feed a model a spreadsheet full of inconsistent dates, misspelled categories, and columns where numbers are stored as text, the output will reflect that mess โ confidently and without warning. Data sanitization is the practice of cleaning and standardizing your data before giving it to any tool.
You wouldn't hand a colleague a report with pages in random order, some paragraphs in French, and half the numbers written in Roman numerals โ then ask for a summary. But that's what feeding messy data to an AI tool is. The model will try its best, but the result will be unreliable in ways you can't easily detect.
| Problem | Example | Fix |
|---|---|---|
| Inconsistent text | "Sales", "sales", "SALES", "Slaes" | Standardize case; use spell-check; search for near-duplicates |
| Mixed date formats | "3/4/2024", "2024-03-04", "March 4, 2024" | Pick one format (ISO 8601: YYYY-MM-DD is best) and apply to the whole column |
| Numbers as text | "$1,200" or "1 200" instead of 1200 | Remove currency symbols, commas, and spaces; ensure column is formatted as a number |
| Missing values | Empty cells, "N/A", "n/a", "-", "unknown" | Decide on one representation (blank or "NA"); be consistent; document what missing means |
| Leading/trailing spaces | " Marketing" vs "Marketing" | Use TRIM() in Excel/Sheets; these invisible differences cause failed lookups |
| Merged cells | A header spanning three columns in Excel | Unmerge all cells before exporting. Merged cells break CSV export and AI parsing |
| Multiple data points in one cell | "Smith, Jones, Lee" in a single "Assignees" cell | Split into separate rows or separate columns โ one value per cell is the rule |
One clear header row โ column names in row 1, data starting in row 2, no blank rows above
Consistent data types per column โ don't mix numbers and text in the same column
One value per cell โ no merged cells, no comma-separated lists inside a single cell
Missing values handled consistently โ pick one representation and use it everywhere
Dates in ISO format โ YYYY-MM-DD (2024-03-15) sorts correctly and is universally readable
No formatting artifacts โ remove currency symbols, percentage signs, and commas from numbers
Saved as .csv (UTF-8) โ the safest, most portable format for AI tools
When you ask an AI to "find the average salary by department" and your data has "Marketing" in some rows and "marketing " (with a trailing space) in others, the AI will treat them as two separate departments and give you two separate averages โ both wrong. When your date column has three different formats, the model may misparse March 4 as April 3. AI tools don't complain about messy data. They just give you confidently wrong answers. Cleaning your data before uploading is the single most impactful thing you can do to get useful output from any AI tool.
Practical Skills
Windows: Search box in File Explorer (top right). Press Win+S to search the entire computer. Filter by date modified, file type, or size. Check Recent items in most apps (File โ Recent).
Mac: Use Spotlight (Cmd+Space) and type the filename. In Finder, press Cmd+F to search within a specific folder.
Accidentally deleted? Check the Recycle Bin (Windows) or Trash (Mac) โ files go there first. Right-click โ Restore.
Use dates in filenames for version tracking: 2024-03-15_Budget_Final.xlsx (YYYY-MM-DD sorts chronologically). Avoid spaces โ use underscores or hyphens. Avoid special characters (/ \ : * ? " < > |). Be specific: Smith_Contract_Signed.pdf not Document1.pdf.
Naming-based versioning (Report_v2_FINAL_JSmithEdits.docx) gets unwieldy fast. Better: use cloud tools like Google Docs or SharePoint that have built-in version history โ you can see every change and restore any previous version. Check: File โ Version History.
The 3-2-1 backup rule: keep 3 copies of important data, on 2 different types of storage, with 1 copy offsite (or in cloud). For most workers: files on your computer (1), synced to cloud storage like OneDrive (2), and optionally on a USB drive (3). Verify with IT what's actually backed up.
Workplace Scenarios
Systematic approach: (1) Check the app's Recent Files (File โ Recent). (2) Search by filename in File Explorer or Spotlight. (3) Search by date โ you know you saved it yesterday. (4) Check cloud folders (OneDrive, Google Drive). (5) Check Downloads folder โ sometimes files default there.
Convert to PDF: In Word, File โ Save As (or Export) โ choose PDF. Or File โ Print โ "Microsoft Print to PDF." On Mac: File โ Print โ PDF โ Save as PDF. PDFs preserve formatting and can't be accidentally edited by the recipient.
Options: (1) Compress the folder โ right-click โ "Compress" (Mac) or "Send to โ Compressed zip folder" (Windows). (2) Share via cloud link โ upload to OneDrive or Google Drive, share the link. Better for large files and keeps them editable. (3) Use WeTransfer for very large files.
Check Your Understanding
Answer all questions to complete this module.
1. In the path C:\Users\Ana\Documents\report.pdf, what is "report.pdf"?
2. Why should you turn on file extensions in Windows?
3. What's the best way to share a 500MB video file with a client?
4. A spreadsheet column called "Revenue" contains values like "$1,200" and "$950". You want to upload this to an AI tool for analysis. What's the problem?
5. You have customer feedback emails (unstructured) and a sales spreadsheet (structured). Which format should you use to upload the sales data to an AI for trend analysis?
6. Your "Department" column has these values: "Marketing", "marketing", "Marketing ", "Mktg". You ask an AI for average salary by department. What will happen?