Subsets of IMDb data are available for access to customers for personal and non-commercial use. You can hold local copies of this data, and it is subject to our terms and conditions. Please refer to the Non-Commercial Licensing and copyright/license and verify compliance.
Data Location and Access Requirements
Files are located in the AWS S3 bucket named imdb-datasets and can be accessed programmatically. The data is refreshed daily. This is a Requester-Pays S3 bucket, and the requester accessing data from this bucket is responsible for the data transfer and request costs. For details on the charges, please refer to https://aws.amazon.com/s3/pricing/.
To access the IMDb data files, you need:
- An Amazon AWS account. You can use an existing account or create one following the instructions here – http://docs.aws.amazon.com/AmazonS3/latest/gsg/SigningUpforS3.html
- A programmatic way to access S3 using the REST API or the AWS SDK wrapper libraries
- S3 Bucket name: imdb-datasets
- Files are located under documents/v1/*
- Data files are stored in a folder with the creation date (in YYYY-MM-DD format) as the label. The latest data files are also stored in a folder labelled “current”. All data files are compressed with gzip.
Please refer to the AWS documentation for how to make requests to S3 - http://docs.aws.amazon.com/AmazonS3/latest/dev/MakingRequests.html. Requests to access the imdb-datasets bucket must be authenticated and will have to include the appropriate request-payer parameter.
Here are two sample Java programs to access IMDb datasets in S3: Copy one S3 bucket to another and download the S3 object to a local file
IMDb Dataset Details
Each date-based folder in the S3 bucket contains datasets in gzipped, tab-separated-values (TSV) format. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name. The available datasets (S3 object keys) are as follows:
title.basics.tsv.gz - Contains the following information for titles:
- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title.
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year.
- endYear (YYYY) – TV Sereis end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title
- tconst (string)
- directors (array of nconsts) - director(s) of the given title
- writers (array of nconsts) – writer(s) of the given title
- tconst (string) - alphanumeric identifier of episode
- parentTconst (string) - alphanumeric identifier of the parent TV Series
- seasonNumber (integer) – season number the episode belongs to
- episodeNumber (integer) – episode number of the tconst in the TV series.
- tconst (string)
- principalCast (array of nconsts) – title’s top-billed cast
- tconst (string)
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received
- nconst (string) - alphanumeric unique identifier of the name/person
- primaryName (string)– name by which the person is most often credited
- birthYear – in YYYY format
- deathYear – in YYYY format if applicable, else ‘\N’
- primaryProfession (array of strings)– the top-3 professions of the person
- knownForTitles (array of tconsts) – titles the person is known for