Pricing
from $1.00 / 1,000 results
Mubawab.ma Housing Scraper
Scrapes Moroccan real estate listings from mubawab.ma and outputs a structured dataset ready for ML model training (price prediction, classification).
Pricing
from $1.00 / 1,000 results
Rating
0.0
(0)
Developer
Actor stats
1
Bookmarked
27
Total users
1
Monthly active users
2 months ago
Last modified
Categories
Share
The Moroccan Housing Dataset β an open-source Apify Actor that scrapes real estate listings from mubawab.ma and produces a flat, ML-ready dataset modelled after the classic California Housing dataset (GΓ©ron, Hands-On ML, Chapter 2).
π Apify Actor
LICENSE
π Node.js 20
π Playwright
π Open Issues
Table of Contents
- What it does
- Output dataset
- Quick start
- Input configuration
- Apify Console output
- ML usage example
- Architecture
- Cities & property types covered
- Contributing
- License
What it does
Morocco's real estate market lacks structured, machine-readable public data. This actor closes that gap by crawling mubawab.ma β Morocco's largest property portal β and extracting every listing into a single CSV/JSON dataset suitable for:
- π Price prediction models (regression)
- π Geo-spatial analysis by city and neighborhood
- π Market trend dashboards
- π€ AI / LLM-powered property assistants
The scraper uses a two-phase Playwright crawl (search results β detail pages) and persists output through the Apify storage API so you can export CSV/JSON directly from the platform or via API with zero extra tooling.
Output dataset
Every scraped listing maps to one row with these fields:
| Field | Type | Description |
|---|---|---|
priceDh | number | null | Target variable β price in Moroccan Dirhams (MAD) |
pricePerM2 | number | null | Derived: price Γ· surface (MAD/mΒ²) |
surfaceM2 | number | null | Living area in mΒ² |
numRooms | integer | null | Bedrooms |
numBathrooms | integer | null | Bathrooms |
floor | integer | null | Floor (0 = ground / RDC) |
propertyType | string | null | appartement, villa, maison, riad, β¦ |
standing | string | null | economique, moyen_standing, haut_standing |
state | string | null | neuf, bon_etat, a_renover, en_cours_de_construction |
city | string | null | Lowercase ASCII name, e.g. casablanca |
neighborhood | string | null | Sub-area within the city |
transactionType | string | null | vente or location |
url | string | Direct link to the listing on mubawab.ma |
title | string | null | Raw listing title |
scrapedAt | string | ISO-8601 scrape timestamp |
Sample record
{"priceDh":1250000,"pricePerM2":12500,"surfaceM2":100,"numRooms":3,"numBathrooms":2,"floor":3,"propertyType":"appartement","standing":"moyen_standing","state":"bon_etat","city":"casablanca","neighborhood":"maΓ’rif","transactionType":"vente","url":"https://www.mubawab.ma/fr/a/12345/appartement-a-vendre-casablanca","title":"Appartement Γ vendre Γ MaΓ’rif, Casablanca","scrapedAt":"2025-03-27T14:32:00.000Z"}
Quick start
Option A β Run on Apify (no setup needed)
- Open the actor on the Apify Store
- Click Try for free
- Configure inputs in the visual form
- Click Start β export results as CSV or JSON once the run completes
Option B β Run locally
Prerequisites: Node.js 20+, Apify CLI
# 1. Install the CLInpminstall-g apify-cli# 2. Clone this repogit clone https://github.com/MuLIAICHI/Mubawab-Housing-Scraper.gitcd Mubawab-Housing-Scraper# 3. Install dependenciesnpminstall# 4. Quick test β 10 listings onlyapify run --input='{"maxListings": 10, "transactionType": "vente"}'# 5. Full run β all 9 cities, up to 5 000 listingsapify run
Results are saved locally under storage/datasets/mubawab-housing/.
Option C β Deploy to your Apify account
apify login # Enter your Apify API tokenapify push # Build & upload the actor
Then run and schedule from console.apify.com.
Input configuration
Configure the actor via the Apify Console form or by passing a JSON input:
| Parameter | Type | Default | Description |
|---|---|---|---|
transactionType | string | "vente" | "vente" Β· "location" Β· "both" |
cities | string[] | (all 9 cities) | Filter to specific cities, e.g. ["casablanca", "rabat"] |
propertyTypes | string[] | 4 main types | appartements Β· villas Β· maisons Β· riads Β· terrains Β· bureaux Β· commerces |
maxListings | integer | 5000 | Hard cap on detail pages scraped (0 = unlimited) |
maxConcurrency | integer | 5 | Parallel browser tabs (max 20) |
startUrls | array | [] | Override seed URLs; leave empty for auto-generation |
proxyConfiguration | object | Apify Residential | Proxy settings β residential proxy is strongly recommended |
Example input
{"transactionType":"vente","cities":["casablanca","marrakech","rabat"],"propertyTypes":["appartements","villas"],"maxListings":1000,"maxConcurrency":5,"proxyConfiguration":{"useApifyProxy":true,"apifyProxyGroups":["RESIDENTIAL"]}}
Apify Console output
After a run completes, the Output tab in Apify Console shows four named links:
| Output | Description |
|---|---|
| Housing listings (Overview) | All scraped records in a table view (city, type, price, surface, rooms, URL) |
| ML-ready dataset | Same records restricted to the 12 ML feature columns β export this as CSV for model training |
| Run statistics | JSON with total listings, pages visited, null-rates per field, elapsed time |
| Debug HTML snapshots | HTML captured when a page could not be parsed β useful for debugging after site updates |
ML usage example (Python)
import pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import r2_score, mean_absolute_error# 1. Load dataset exported from Apify as CSV (ML Dataset view)df = pd.read_csv("mubawab_dataset.csv")# 2. Drop rows missing the target variabledf = df.dropna(subset=["priceDh","surfaceM2"])# 3. Encode categoricalsdf = pd.get_dummies(df, columns=["propertyType","standing","state","city","transactionType"])# 4. Feature engineering β GΓ©ron-style derived featuresdf["roomsPerM2"]= df["numRooms"]/ df["surfaceM2"]feature_cols =[c for c in df.columns if c notin["priceDh","pricePerM2","neighborhood","url","title","scrapedAt"]]X = df[feature_cols].fillna(0)y = df["priceDh"]# 5. Train & evaluateX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)model = RandomForestRegressor(n_estimators=200, random_state=42)model.fit(X_train, y_train)y_pred = model.predict(X_test)print(f"RΒ² : {r2_score(y_test, y_pred):.3f}")print(f"MAE : {mean_absolute_error(y_test, y_pred):,.0f} MAD")
Architecture
.βββ .actor/β βββ actor.json β Actor metadata + schema referencesβ βββ input_schema.json β Typed input form for Apify Consoleβ βββ output_schema.json β Output tab links (dataset + KV store)β βββ dataset_schema.json β Field definitions + two table viewsβ βββ key_value_store_schema.json β KV store collections (stats / snapshots)ββββ src/β βββ main.js β Entry point: reads input, seeds URLs, starts crawlerβ βββ router.js β Crawlee router with LISTING_PAGE + DETAIL_PAGE labelsβ βββ parsers/β β βββ listingPage.js β Extracts listing URLs + next-page link from search resultsβ β βββ detailPage.js β Extracts all 15 schema fields from a property detail pageβ βββ utils/β βββ normalize.js β Pure functions: parsePrice(), parseSurface(), normalizeCity()ββββ Dockerfile β Apify Playwright image (Node.js 20 + Chromium)βββ package.jsonβββ README.md
Crawl flow
main.js ββbuilds seed URLsβββΊ LISTING_PAGE handlerββββββββΌββββββββββββββββββββββββββββ Parse search result page ββ Extract listing URLs ββ Follow rel="next" pagination ββββββββ¬ββββββββββββββββββββββββββββ enqueue detail URLsβββββββΌββββββββββββββββββββββββββββ DETAIL_PAGE handler ββ detailPage.js extracts fields ββ normalize.js cleans values ββ Actor.pushData() β dataset βββββββββββββββββββββββββββββββββββ
Key technical decisions
- Playwright (not Cheerio) β mubawab.ma is JS-rendered; a headless browser is required
- Multiple CSS selector fallbacks β the site uses different HTML structures for individual listings vs. project/ensemble listings
- Polite delays β 500β800 ms between requests to avoid rate-limiting
- Named dataset
mubawab-housingβ makes the output easy to find and retrieve via API
Cities & property types covered
Cities (default): Casablanca · Marrakech · Rabat · Agadir · Tanger · Fès · Meknès · Oujda · Tétouan
Property types: Appartements Β· Villas Β· Maisons Β· Riads Β· Terrains Β· Bureaux Β· Commerces
Pass any subset via the cities and propertyTypes input fields.
Proxy recommendation
mubawab.ma blocks datacenter IPs. Using Apify Residential Proxy (the default) is strongly recommended for production runs. A free Apify account includes a proxy trial.
Without a proxy, you will encounter CAPTCHAs and 403 errors.
Contributing
Contributions are welcome! Here is how to get started:
- Fork this repository
- Create a feature branch:
git checkout -b feat/your-feature - Make your changes and run a quick local test:
$apify run --input='{"maxListings": 5}'
- Open a Pull Request with a clear description of what changed and why
Good first issues
- Add support for additional Moroccan cities (
agadir,beni-mellal,laayouneβ¦) - Improve null-rate for
standingandstatefields on project listings - Add
listing_idextraction from the URL slug - Write unit tests for
normalize.js(Jest or Vitest)
Please open an issue before starting large changes.
License
LICENSE Β© 2025 Mustapha LIAICHI
Built with Crawlee Β· Playwright Β· Apify SDK
